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Abstract 

Sparse  signal  models  have  been  the  focus  of  much  recent 
research,  leading  to  (or  improving  upon)  state-of-the-art  re¬ 
sults  in  signal,  image,  and  video  restoration.  This  article  ex¬ 
tends  this  line  of  research  into  a  novel  framework  for  local 
image  discrimination  tasks,  proposing  an  energy  formula¬ 
tion  with  both  sparse  reconstruction  and  class  discrimina¬ 
tion  components,  jointly  optimized  during  dictionary  learn¬ 
ing.  This  approach  improves  over  the  state  of  the  art  in  tex¬ 
ture  segmentation  experiments  using  the  Brodatz  database, 
and  it  paves  the  way  for  a  novel  scene  analysis  and  recogni¬ 
tion  framework  based  on  simultaneously  learning  discrimi¬ 
native  and  reconstructive  dictionaries.  Preliminary  results 
in  this  direction  using  examples  from  the  Pascal  VOC06  and 
Graz02  datasets  are  presented  as  well. 


1.  Introduction 

Sparse  representations  have  recently  drawn  much  inter¬ 
est  in  signal,  image,  and  video  processing.  Under  the  as¬ 
sumption  that  natural  images  admit  a  sparse  decomposition 
in  some  redundant  basis  (or  so-called  dictionary),  several 
such  models  have  been  proposed,  e.g.,  curvelets,  wedgelets, 
bandlets  and  various  sorts  of  wavelets  [21].  Recent  pub¬ 
lications  have  shown  that  learning  non-parametric  dictio¬ 
naries  for  image  representation  instead  of  using  off-the- 
shelf  ones,  can  significantly  improve  image  restoration,  e.g., 
[6,  30,31,38]. 

Consider  a  signal  x  in  We  say  that  x  admits  a  sparse 
approximation  over  a  dictionary  D  in  composed  of 

k  elements  (atoms),  when  we  can  find  a  linear  combina¬ 
tion  of  a  “few”  atoms  from  D  that  is  “close”  to  the  origi¬ 
nal  signal  X.  A  number  of  practical  algorithms  have  been 
developed  for  learning  such  dictionaries  like  the  K-SVD  al¬ 
gorithm  [2]  and  the  method  of  optimal  directions  (MOD) 
[7].  This  approach  has  led  to  several  restoration  algorithms, 
which  equal  or  exceed  the  state  of  the  art  in  tasks  such  as 
image  and  video  denoising,  inpainting,  demosaicing  [6, 19], 
and  texture  synthesis  [27].  Alternative  models  that  learn  im- 

^WILLOW  project-team,  Laboratoire  d’Informatique  de  I’Ecole  Nor¬ 
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age  representations  can  be  found  in  [10,  12]. 

The  computer  vision  community  has  also  been  interested 
in  extracting  sparse  information  from  images  for  recogni¬ 
tion,  e.g.,  by  designing  local  texture  models  [15,  37],  which 
have  proven  to  be  discriminative  enough  to  be  used  in  image 
segmentation  [18,  33].  Introducing  learning  into  the  feature 
extraction  task  has  been  part  of  the  motivation  for  some  re¬ 
cent  works:  e.g.,  in  [29],  image  features  are  learned  using 
convolutional  neural  networks;  while  in  [13,  39],  discrimi¬ 
native  strategies  for  learning  visual  codebooks  for  nearest- 
neighbor  search  are  presented. 

Sparse  decompositions  have  also  been  used  for  face 
recognition  [40,  41],  signal  classification  [8,  9]  and  texture 
classification  [14,  27,  34].  Interestingly,  while  discrimi¬ 
nation  is  the  main  goal  of  these  papers,  the  optimization 
(dictionary  design)  is  purely  generative,  based  on  a  criteria 
which  does  not  explicitly  include  the  actual  discrimination 
task,  which  is  one  of  the  key  contributions  of  our  work. 

The  framework  introduced  in  this  paper  addresses  the 
learning  of  multiple  dictionaries  which  are  simultaneously 
reconstructive  and  discriminative,  and  the  use  of  the  recon¬ 
struction  errors  of  these  dictionaries  on  image  patches  to  de¬ 
rive  a  pixelwise  classification.  The  novelty  of  the  proposed 
approach  is  twofold:  First,  redundant  non-parametric  dic¬ 
tionaries  are  learned,  in  contrast  with  the  more  common  use 
of  predefined  features  and  dictionaries  [9,  40,  41].  Second, 
the  sparse  local  representations  are  learned  with  an  explicit 
discriminative  goal,  making  the  proposed  model  very  dif¬ 
ferent  from  traditional  reconstructive  ones  [8,  27,  34].  We 
illustrate  the  benefits  of  this  approach  in  a  texture  segmen¬ 
tation  task  on  the  Brodatz  dataset  [28]  for  which  it  signif¬ 
icantly  improves  over  the  state  of  the  art  [16,  17,  34],  and 
also  present  preliminary  results  showing  that  it  can  be  used 
to  learn  discriminative  key  patches  from  the  Pascal  VOC06 
[26]  database,  and  perform  the  weakly  supervised  form  of 
feature  selection  advocated  in  [25,  36]  on  images  from  the 
Graz02  dataset  [24]. 

Section  2  presents  the  classical  framework  for  learning 
dictionaries  for  the  sparse  representation  of  overlapping  im¬ 
age  patches.  The  discriminative  framework  is  introduced  in 
Section  3,  along  with  the  corresponding  optimization  pro¬ 
cedure.  Section  4  is  devoted  to  experimental  results  and 
applications,  and  Section  5  concludes  this  paper. 


2.  Dictionary  learning  for  reconstruction 
2.1.  Learning  reconstructive  dictionaries 

We  now  briefly  describe  for  completeness  the  K-SVD  [2] 
and  MOD  [7]  algorithms  for  learning  dictionaries  for  natu¬ 
ral  images.  Since  images  are  usually  large,  these  techniques 
are  designed  to  work  with  overlapping  patches  instead  of 
whole  images  (one  patch  centered  at  every  pixel).  We  de¬ 
note  by  n  the  number  of  pixels  per  patch,  and  write  patches 
as  vectors  x  in  Learning  an  over-complete  dictionary 
with  a  flxed  number  k  of  atoms,  that  is  adapted  to  M  patches 
of  size  n  from  natural  images,  and  with  a  sparsity  constraint 
(each  patch  has  less  than  L  atoms  in  its  decomposition),  is 
addressed  by  solving  the  following  minimization  problem: 

M 

miny^  ||x,  -  Da,||2  s.t.  ||a;||o<i^.  (1) 

’  1  =  1 

In  this  equation,  is  an  image  patch  written  as  a  column 
vector.  D  in  i^nx/c  is  a  dictionary  to  be  learned,  each  of 
its  atoms  (columns)  is  a  unit  vector  in  the  £2  norm.  The 
problem  is  to  And  the  optimal  dictionary  that  leads  to  the 
lowest  reconstruction  error  given  a  flxed  sparsity  factor  L. 
The  vector  cxi  in  is  the  sparse  representation  for  the  l-th 
patch  using  the  dictionary  D.  |  |x|  |o  denotes  the  “^o-norm” 
of  the  vector  x,  which  is  not  a  formal  norm,  but  a  measure 
of  sparsity  counting  the  number  of  non-zero  elements  in  a 
vector.  Then  one  can  introduce 

a*(x,  D)  =  argmin  ||x  —  Da||2,  s.t.  ||a||o  <  L, 

7^(x,D,a)  =||x  — Da||2, 

7e*(x,D)  =||x-Da*(x,D)||2. 

.  ^  (2) 
7^*(x,  D)  represents  the  best  representation  error  of  x  on 

D  with  a  sparsity  factor  L  (number  of  coefficients  in  the 
representation),  in  terms  of  ^2 -norm.  It  will  be  used  later  in 
this  paper  as  a  discriminative  function. 

Both  K-SVD  and  MOD  are  iterative  approaches  de¬ 
signed  to  minimize  the  energy  (1).  First,  a  dictionary  D 
in  i^nx/c  initialized,  e.g.,  from  random  patches  of  natural 
images.  Then  the  main  loop  is  composed  of  two  stages: 

•  Sparse  coding:  During  this  step,  D  is  flxed  and  one  ad¬ 
dresses  the  NP-hard  problem  of  flnding  the  best  decompo¬ 
sition  of  each  patch  /: 

a(  =  a*(x;,D).  (3) 

A  greedy  orthogonal  matching  pursuit  [22]  is  used,  which 
has  been  shown  to  be  very  efficient  [35],  although  being 
theoretically  suboptimal.^ 

•  Dictionary  update:  Here  is  the  only  difference  between 
K-SVD  and  MOD.  In  the  case  of  MOD,  the  decompositions 
cxi  are  flxed  and  a  least-squares  problem  is  solved  updating 
all  the  atoms  simultaneously: 

^  Other  strategies  can  be  employed  like  convexification  via  the  replace¬ 
ment  of  the  ^o-norm  by  the  ^i-norm  (basis  pursuit  approach  [5]). 


Input:  D  G  (dictionary  from  the  previous  iter¬ 

ation);  M  vectors  x^  G  (input  data);  a  G 
(coefficients  from  the  previous  iteration). 

Output:  D  and  a. 

Loop:  For  j  =  1 . .  .k,  update  d,  j-th  column  of  D, 

•  Select  the  set  of  patches  that  use  d: 

(5) 

•  For  each  patch  I  e  uj,  compute  the  residual  of  the  de¬ 
composition  of  x^:  r/  =  x^  —  Dcxi. 

•  Compute  a  new  atom  d'  G  R^  and  the  associated  coef- 
flcients  £3  G  that  minimize  the  residual  error  on  the 
selected  set  uo,  using  a  truncated  SVD: 

Vllr; +Q:;[j]d-/3;d'||2.  (6) 

||d'||2=i,/3eRi“i  ^ 

•  Update  D  using  the  new  atom  d',  and  replace  the 
scalars  cxi[j]  7^  0  from  Eq.  (5)  in  a  using  jS. 

Figure  1.  K-SVD  step  for  updating  D  and  ol. 

M 

min  V'7^(xi,D,a;).  (4) 

In  the  case  of  K-SVD,  the  values  of  the  non-zero  coeffi¬ 
cients  in  the  cxi  are  not  flxed,  and  are  updated  at  the  same 
time  as  the  dictionary  D.  Finding  which  are  the  L  non¬ 
zero  coefficients  in  a  patch  decomposition,  or  in  other  words 
which  are  the  atoms  that  take  part,  is  a  difficult  problem  that 
is  addressed  during  the  sparse  coding  step.  On  the  other 
hand,  setting  the  values  of  these  non-zero  coefficients,  once 
they  are  selected,  is  an  easy  task,  which  is  addressed  at 
the  same  time  as  the  update  of  D.  To  do  so,  and  as  de¬ 
tailed  in  Figure  1,  each  atom  (column)  of  the  dictionary 
and  the  coefficients  associated  with  it  (the  non-zero  coef- 
flcients  in  a)  are  updated  sequentially [2].  In  practice,  it  has 
been  observed  that  K-SVD  converges  with  less  iterations 
than  MOD  [6],  motivating  the  selection  of  K-SVD  for  our 
proposed  framework,  (though  MOD  could  be  used  instead). 
Typically  10  or  20  iterations  are  enough  for  K-SVD,  with 
M  =  100  000  patches  of  size  n  =  64  (8  x  8)  and  k  =  256 
atoms. 

2.2.  A  reconstructive  approach  to  discrimination 

Assume  that  we  have  N  sets  Si  of  training  patches, 
i  =  1 . . .  N,  belonging  to  N  different  classes.  The  simplest 
strategy  for  using  dictionaries  for  discrimination  consists  of 
first  learning  N  dictionaries  D^,  i  =  1 ...  N,  one  for  each 
class.  Approximating  each  patch  using  a  constant  sparsity  L 
and  the  N  different  dictionaries  provides  N  different  resid¬ 
ual  errors,  which  can  then  be  used  as  classification  features. 
This  is  essentially  the  strategy  employed  in  [27,  34].  Thus, 
the  first  naive  way  of  estimating  the  class  io  for  some  patch 


X  is  to  write  (as  in  [41],  but  with  learned  dictionaries): 

iQ  =  arg  min  TV"  (x,  ) .  (7) 

Instead  of  this  reconstruction-based  approach,  we  show  that 
better  results  can  be  achieved,  in  general,  by  learning  dis¬ 
criminative  sparse  representations,  while  keeping  the  same 
robustness  against  noise  or  occlusions  (see  also  [9]),  which 
discriminative  methods  may  be  sensitive  to  [40].^ 

3.  Learning  discriminative  dictionaries 
3.1.  Discriminative  model 

The  main  goal  of  our  paper  is  to  propose  a  model  that 
learns  discriminative  dictionaries  and  not  only  reconstruc¬ 
tive  ones.  We  still  want  to  use  the  residual  error  7^'*'(x,  D^) 
as  a  discriminant,  which  has  already  been  shown  to  be  rela¬ 
tively  efficient  in  classification  [27,  34,  40,  41],  but  we  want 
also  to  increase  the  discriminative  power  of  7^'*'(x,  D^),  us¬ 
ing  the  idea  that  a  dictionary  associated  to  a  class  Si 
should  be  “good”  at  reconstructing  this  class,  and  at  the 
same  time  “bad”  for  the  other  classes.  To  this  effect,  we 
introduce  a  discriminative  term  in  Eq.  (1),  using  classical 
softmax  discriminative  cost  functions,  fovi  =  1 ..  .N, 

N 

Ci{yi,y2,-;yN)  =  log  (8) 

j  =  l 

which  is  close  to  zero  when  yi  is  the  smallest  value  among 
the  yj,  and  provides  an  asymptotic  linear  penalty  cost 
X{yi  —  miuj  yj)  otherwise.  These  are  the  multiclass  ver¬ 
sions  of  the  logistic  function  presented  on  Figure  2,  which 
is  differentiable  and  enjoys  properties  similar  to  the  hinge 
loss  function  from  the  support  vector  machine  (S  VM)  liter¬ 
ature.  Increasing  the  value  of  the  new  parameter  A  >  0 
provides  a  higher  relative  penalty  cost  for  each  misclas- 
sified  patch,  at  the  same  time  making  the  cost  function 
less  smooth,  in  terms  of  maxima  of  its  second  derivatives. 
Denoting  the  set  of  these  N  dictionaries  and 

{7^*(x,  the  set  of  the  N  different  reconstruction 

errors  provided  by  the  N  dictionaries,  we  want  to  solve 

min  ^C,y{7^*(x,,D,■)}f=l)+A77^*(x^,D0,  (9) 

leSi 

where  the  parameter  7  >  0  controls  the  trade-off  between 
reconstmction  and  discrimination.  With  a  high  value  for 
7,  our  model  is  close  to  the  classical  reconstructive  one, 
loosing  discriminative  power.  Note  that  an  optional  normal¬ 
ization  factor  l/\Si\  in  front  of  each  term  can  be  added  to 

^Note  also  that  due  to  the  over-completeness  of  the  dictionaries,  and 
possible  correlations  between  the  classes,  sparse  representations  of  mem¬ 
bers  of  one  class  with  dictionaries  from  a  different  class  can  still  be  very 
efficient  and  produce  small  values  for  the  residual  error  7^* . 


Figure  2.  The  logistic  function  (red,  continuous),  its  first  derivative 
(blue,  dashed),  and  its  second  derivative  (black,  dotted). 

ensure  the  same  weight  to  each  class.  Solving  this  prob¬ 
lem  will  explicitly  tend  to  make  each  dictionary  not 
only  good  for  its  own  class  Si,  but  also  better  for  Si  than 
any  of  the  other  dictionaries  Dj,  j  7^  i,  that  are  being  si¬ 
multaneously  learned  for  the  other  classes.  At  first  sight, 
it  might  seem  that  this  new  optimization  problem  suffers 
from  several  difficulties.  Not  only  is  it  non-convex  and  non- 
differentiable  because  of  the  in  the  function  TV, 

but  it  also  seems  to  lose  the  simplicity  of  possible  least- 
squares-based  (Eq.  [4])  or  SVD-based  (Eq.  [6])  dictionary 
updates.  We  actually  present  next  a  MOD/K-SVD  type  of 
iterative  scheme  to  address  this  problem,  which  includes  the 
automatic  tuning  of  the  parameters  A  and  7. 

3.2.  Optimization  procedure 

We  want  to  apply  an  iterative  scheme  similar  to  those  ef¬ 
fectively  used  for  the  purely  reconstmctive  dictionary  learn¬ 
ing  techniques.  First,  one  initializes  the  dictionaries  in 
for  i  =  1 ...  N,  using  a  few  iterations  of  K-SVD  on 
each  class  separately  (reconstructive  approach).  Then  the 
main  loop  at  iteration  k  is  composed  of: 

•  Sparse  coding:  This  step  remains  the  same.  The  dictio¬ 
naries  are  fixed,  and  one  computes  the  decompositions 
cxii  =  a*(x^,  Di),  for  each  patch  I  and  dictionary  i.  Note 
that  while  this  step  was  directly  involved  in  the  minimiza¬ 
tion  of  Eq.  (1)  in  the  reconstructive  approach,  its  purpose 
here  is  not  to  minimize  Eq.  (9),  but  to  compute  the  a*  and 
W  that  are  necessary  in  the  following  step.^ 

•  Dictionary  update:  This  step  is  the  one  that  will  make 
the  dictionaries  more  discriminative.  Following  the  same 
ideas  as  in  K-SVD,  we  want  to  compute  new  dictionaries 
D^.  This  is  done  by  updating  each  single  atom  (column)  of 
each  dictionary  sequentially,  while  letting  the  correspond¬ 
ing  a  coefficients,  associated  with  an  atom  during  the  pre¬ 
vious  sparse  coding,  to  change  as  well.  This  update  step 
raises  several  questions,  which  are  discussed  below.  The 
full  procedure  is  given  in  Figure  3  at  the  end  of  this  section. 

Choice  of  the  parameters  A  and  7.  These  parameters  are 
critical  to  ensure  the  stability  and  the  efficiency  of  our  pro- 

^The  possible  use  of  a  discriminative  component,  on  top  of  the  tradi¬ 
tional  reconstructive  one,  inside  the  sparse  coding  stage  itself  is  the  subject 
of  ongoing  parallel  efforts. 


posed  scheme.  To  automatically  choose  the  value  of  A  that 
gives  the  best  result  in  term  of  classification  performance, 
we  have  opted  to  use  a  varying  A  and  to  replace  it  by  an 
ascending  series.  At  iteration  p,  the  dictionary  update  step 
should  now  use  the  value  Xp.  Starting  from  a  low  value 
for  A,  it  is  possible  to  check  every  few  iterations  whether 
increasing  this  value  provides  a  better  classification  rate. 
Experimentally,  this  simple  idea  has  proven  to  always  give 
faster  convergence  and  significantly  better  results  than  us¬ 
ing  a  fixed  parameter. 

When  7  =  0,  one  addresses  a  pure  discriminative  task, 
which  can  be  difficult  to  stabilize  in  our  scheme.  Indeed, 
one  difference  between  our  framework  and  K-SVD  is  that 
the  sparse  coding  and  dictionary  update  steps  do  not  attempt 
to  minimize  exactly  the  same  thing.  While  the  first  one  ad¬ 
dresses  the  reconstructive  problem  of  Eq.  (2),  the  second 
one  addresses  the  (mostly)  discriminative  learning  task  of 
Eq.  (9).  To  cope  with  this  potential  source  of  instability, 
adding  some  weight  (through  a  bigger  7)  to  the  reconstruc¬ 
tive  term  will  draw  the  minimization  problem  toward  the 
reconstructive  (and  stable)  one  from  Eq.  (1).  Therefore, 
one  has  to  choose  7  large  enough  to  ensure  the  stability  of 
the  scheme,  but  as  small  as  possible  to  enforce  the  discrim¬ 
inative  power.^  Practically,  using  a  varying  7  by  building 
a  descending  series  starting  with  a  high  value  and  updating 
the  same  way  as  for  A,  gives  stability  to  the  algorithm.  To 
simplify  the  notation,  we  will  omit  these  variations  of  A  and 
7  in  the  following. 

Updating  the  dictionary  in  a  MOD-like  fashion.  Let  us 

first  consider  how  to  solve  the  dictionary  update  step  like  in 
the  MOD  algorithm,  with  varying  A  and  varying  reconstruc¬ 
tion  term  7  as  detailed  above.  This  minimization  problem 
can  be  written  as 

min  Dj,  a,j), 

leSi 

(10) 

with  the  OLii  being  fixed  and  computed  during  the  previous 
sparse  coding  step.  The  proposed  MOD-like  approach  con¬ 
sists  of  updating  sequentially  each  dictionary  using  a  trun¬ 
cated  Newton  iteration:  Let  us  denote  by  C(D)  the  cost 
function  optimized  with  respect  to  a  dictionary  D  in  Eq. 
(10).  Performing  one  Newton  iteration  to  solve  VC(D)  =0 
is  equivalent  to  minimizing  a  second-order  Taylor  approxi¬ 
mation  of  C  (see  [3],  p.  484).  When  computing  the  Hessian 
of  C,  we  have  chosen  to  do  a  local  linear  approximation^  of 
the  softmax  functions  Ci  by  neglecting  their  second  deriva¬ 
tives,  keeping  only  the  second-order  terms  that  come  from 
IZ.  This  use  of  an  approximated  Hessian  in  a  Newton  iter- 

^Interestingly,  we  have  observed  that  this  strategy  has  consequences 
that  are  similar  to  the  ideas  that  bring  robustness  to  the  Levenberg- 
Marquardt  algorithm  [23],  as  noticed  later  in  this  paper. 

^This  kind  of  approximation  is  classical  in  optimization  and  derives 
from  the  same  ideas  that  motivate  the  use  of  Gauss-Newton  methods  for 
nonlinear  least-squares. 


ation  is  indeed  an  instance  of  a  truncated  Newton  iteration, 
which  can  be  performed  very  efficiently  in  our  case. 

As  already  noticed,  the  softmax  functions  Cf  are  mainly 
composed  of  quasi-linear  parts,  making  the  local  linear  ap¬ 
proximation  suitable  for  most  of  the  patches.  This  is  illus¬ 
trated  in  Eigure  2  for  the  logistic  function  (softmax  function 
with  only  2  classes).  When  patches  are  correctly  classi¬ 
fied,  their  cost  is  rapidly  close  to  0  (right  part  of  the  figure). 
When  they  are  “misclassified  enough,”  they  lie  on  a  noncon¬ 
stant  linear  part  of  the  cost  function  (left  part  of  the  figure). 
It  can  be  shown  that  performing  our  truncated  Newton  it¬ 
eration  to  update  the  p-th  dictionary  is  equivalent  to  solve 

N 

min  >  >  cxi^)  where  (11) 

dC^ 

wi  =  ^({^*(xz,Dj)}7i)  +X^lp{i),  (12) 

the  variables  Pp  are  defined  in  Eq.  (8),  and  lp{i)  is  equal 
to  1  if  i  =  p  and  0  otherwise.  This  formally  resembles 
a  weighted  least  squares  problem,  but  it  is  different  since 
some  of  the  weights  may  be  negative.  It  admits  a  unique  so¬ 
lution  if  and  only  if  the  Hessian  matrix  A  =  wicxipOtJ^ 
is  positive  definite.  The  reconstructive  term  (7  >  0)  plays 
an  important  role  here.  By  adding  weight  exclusively  to 
the  diagonal  of  the  matrix  A,  we  have  experimentally  ob¬ 
served  that  A  is  almost  always  positive  definite,  which  is  a 
key  concept  behind  the  classical  Levenherg-Marquardt  al¬ 
gorithm  for  nonlinear  least-squares  [23]. 

From  MOD  to  K-SVD.  Now  that  we  have  shown  how  to 
use  a  MOD-like  dictionary  update  (no  change  on  ol  during 
this  step),  we  present  the  main  ideas  behind  the  use  of  a 
K-SVD-like  update.  Recall  that  this  sequentially  improves 
each  atom  of  each  dictionary,  while  allowing  for  the  coeffi¬ 
cients  OL  to  change  (improve)  at  the  same  time.  Experimen¬ 
tally,  we  have  observed  that  this  approach  converges  faster 
and  gives  better  results  than  the  MOD-like  update  in  our  dis¬ 
criminative  framework.  The  main  idea  we  use  is  the  local 
linear  approximation  of  the  softmax  function,  which  pro¬ 
vides  us  with  a  much  easier  function  to  minimize.  The  de¬ 
tailed  algorithm  is  presented  in  Eigure  3.  The  first  two  steps 
are  identical  to  the  K-SVD  for  reconstruction  (Eigure  1); 
These  address  the  selection  of  the  patches  that  use  the  atom 
being  updated.  The  other  steps  come  from  the  following 
modification  of  Eq.  (11)  from  the  MOD-like  dictionary  up¬ 
date.  Updating  d,  j-th  atom  of  the  dictionary  i,  while  allow¬ 
ing  its  corresponding  coefficients  to  change,  can  be  written 
as  Eq.  (15),  which  is  a  simple  eigenvalue  problem,  where 
d'  is  the  eigenvector  associated  to  the  largest  eigenvalue  of 

N 

B  =  ^  ^  wi{ri  +  cxii[j]d){ri  +  aii[j]df .  (13) 

l^SpDuj 


This  way,  for  each  atom  sequentially,  we  address  the  prob¬ 
lem  of  Eq.  (9),  under  the  only  assumption  that  the  softmax 
functions  can  be  locally  linearly  approximated. 

Input:  N  dictionaries  G  M  vectors  x/  G 

(input  data);  Si  (classification  of  the  input  data);  a  G 
^kxMN  (coefficients  from  the  previous  iteration). 

Output:  D  and  a. 

Loop:  For  i  =  1 ...  N ,  for  j  =  1 ...  k,  update  d,  the  j-th 
column  of  D^: 

•  Select  the  set  of  patches  that  uses  d: 

u;^{lel...M\cxii[j]^0}.  (14) 

•  For  each  patch  I  in  uo,  compute  the  residual  of  the  de¬ 
composition  of  X/:  r/  =  —  'DiCXii. 

•  Compute  the  same  weights  wi  as  in  Equation  (12),  for 
all  p  =  1 . . .  A^,  for  all  /  in  5'^  D  uo. 

•  Compute  a  new  atom  d'  G  and  the  associated  coef¬ 
ficients  /3  G  that  minimize  the  residual  error  on  the 
selected  set  uj,  using  Eq.  (13) 

min  V  wi\\ri  +  aii[j]d- Pid'Wj.  (15) 

•  Update  and  a  using  the  new  atom  d',  and  replace 
the  scalars  cxii[j]  7^  0  from  Eq.  (14)  in  a  using  /3. 

Figure  3.  Dictionary  update  step  for  the  proposed  discriminative 
K-SVD,  provided  the  parameters  A  and  7. 

3.3.  Data  issues 

Before  presenting  experiments,  we  discuss  here  some 
data  related  issues.  Depending  on  the  particular  application 
of  the  proposed  framework,  different  prefiltering  operations 
can  be  applied  to  the  input  data.  We  mention  first  the  pos¬ 
sibility  to  apply  a  Gaussian  mask  to  the  input  patches  (by 
multiplying  element-wise  the  patches  by  the  mask),  in  or¬ 
der  to  give  more  weight  to  the  center  of  the  patches,  since 
the  framework  is  designed  for  local  discrimination.  We 
mention  also  the  possibility  to  pre-process  the  data  using 
a  Laplacian  filter,  which  has  proven  to  give  more  discrimi¬ 
natory  power.  Since  a  Laplacian  filter  can  be  represented  by 
a  difference  of  Gaussians,  this  step  is  consistent  with  previ¬ 
ous  works  on  local  descriptors.  Both  proposed  pre-filtering 
can  be  simultaneously  used  depending  on  the  chosen  appli¬ 
cation. 

The  presented  framework  is  fiexible  in  the  sense  that  it 
is  very  easy  to  take  into  account  different  types  of  vecto¬ 
rial  information.  For  instance,  using  color  patches  for  the 
K-SVD  has  been  addressed  in  [19]  by  concatenating  R,G,B 
information  from  a  patch  into  single  vectors.  This  can  be  di¬ 
rectly  applied  here.  One  could  also  opt  to  only  include  the 
mean  color  of  a  patch,  if  we  consider  that  the  geometrical 
structure  is  more  meaningful  for  discrimination.  It  is  there¬ 


fore  possible  to  work  with  vectors  representing  grayscale 
patches  and  just  3  average  R,G,B  values.  This  permits  to 
take  into  account  the  color  information  without  multiplying 
by  3  the  dimensionality  of  the  patches.  Depending  on  the 
data,  other  types  of  information  could  be  added  this  way. 

4.  Experimental  results  and  applications 
4.1.  Texture  segmentation  of  the  Brodatz  dataset 

Texture  segmentation  and  classification  is  a  natural  ap¬ 
plication  of  our  framework,  since  it  can  be  formulated  as  a 
local  feature  extraction  and  patch  classification  process.  We 
have  chosen  to  evaluate  our  method  on  the  Brodatz  dataset, 
introduced  in  [28],  which  provides  a  set  of  “patchwork” 
images  composed  of  textures  from  different  classes,  and  a 
training  sample  for  each  class.  The  suite  of  the  12  images  is 
presented  in  [17]. 

In  our  experiments,  following  the  same  methodology  as 
[34],  each  of  the  patches  of  the  training  images  were  used 
as  a  training  set.  We  use  patches  of  size  n  =  144  (12  x  12), 
dictionaries  of  size  k  =  12S  and  a  sparsity  factor  L  =  4. 
A  Gaussian  mask  of  standard  deviation  4  (element-wise 
multiplication)  and  a  Laplacian  filter  were  applied  on  each 
patch  as  a  prefiltering.  Then  30  iterations  of  our  discrimi¬ 
native  framework  are  performed.  After  this  training  stage, 
each  one  of  the  12  x  12  patches  from  the  test  images  are 
classified  using  the  learned  dictionaries  (by  comparing  the 
corresponding  representation  error  7^*  for  each  dictionary). 
Smoothing  follows  to  obtain  the  segmentation.  We  present 
two  alternatives,  either  using  a  simple  Gaussian  filtering 
with  a  standard  deviation  of  12,  or  applying  a  graph-cut 
alpha-expansion  algorithm,  [4,  11],  based  on  a  classical 
Potts  model  with  an  8-neighborhood  system.  The  cost  asso¬ 
ciated  to  a  patch  x  and  a  class  Si  is  0  if  x  has  been  classified 
as  part  of  Si,  and  1  otherwise.  A  constant  regularization 
cost  between  two  adjacent  patches  of  1.75  has  proven  to  be 
appropriate.  Table  1  reports  the  results. 

From  these  experiments,  we  observe  that  our  method  sig¬ 
nificantly  outperforms  those  reported  in  [16,  17,  28,  34], 
regardless  of  the  selected  regularization  method,  and  per¬ 
forms  best  for  all  but  two  of  the  images.  Moreover,  while 
the  purely  reconstructive  framework  already  provides  good 
results,  we  observe  that  the  discriminative  one  noticeably 
improves  the  classification  rate  except  for  examples  5  and 
12.  In  these  two  particular  cases,  it  has  proven  to  be  an 
artefact  from  the  image  smoothing.  The  classification  rate 
before  and  after  smoothing  are  indeed  not  fully  correlated. 
Smoothing  will  better  remove  isolated  misclassified  patches 
and  sometimes  the  misclassified  patches  from  the  recon¬ 
structive  approach  are  more  isolated  than  with  the  discrimi¬ 
native  one.  Note  that  our  model  under-performs  at  image  2, 
where  the  size  of  the  patches  we  had  chosen  has  proven  to 
be  particularly  not  adapted,  which  is  a  motivation  for  devel¬ 
oping  a  multiscale  framework.  Some  qualitative  results  are 
presented  in  Figure  4. 


tt 

[28] 

[17] 

[34] 

[16] 

R1 

R2 

Dl 

D2 

1 

7.2 

6.7 

5.5 

3.37 

2.22 

1.69 

1.89 

1.61 

2 

18.9 

14.3 

7.3 

16.05 

24.66 

36.5 

16.38 

16.42 

3 

20.6 

10.2 

13.2 

13.03 

10.20 

5.49 

9.11 

4.15 

4 

16.8 

9.1 

5.6 

6.62 

6.66 

4.60 

3.79 

3.67 

5 

17.2 

8.0 

10.5 

8.15 

5.26 

4.32 

5.10 

4.58 

6 

34.7 

15.3 

17.1 

18.66 

16.88 

15.50 

12.91 

9.04 

7 

41.7 

20.7 

17.2 

21.67 

19.32 

21.89 

11.44 

8.80 

8 

32.3 

18.1 

18.9 

21.96 

13.27 

11.80 

14.77 

2.24 

9 

27.8 

21.4 

21.4 

9.61 

18.85 

21.88 

10.12 

2.04 

10 

0.7 

0.4 

NA 

0.36 

0.35 

0.17 

0.20 

0.17 

11 

0.2 

0.8 

NA 

1.33 

0.58 

0.73 

0.41 

0.60 

12 

2.5 

5.3 

NA 

1.14 

1.36 

0.37 

1.97 

0.78 

Av. 

18.4 

10.9 

NA 

10.16 

9.97 

10.41 

7.34 

4.50 

Table  1 .  Error  rates  for  the  segmentation/classification  task  for 
the  Brodatz  dataset.  The  proposed  framework  is  compared  with 
a  number  of  reported  state-of-the-art  results  [16,  17,  34]  and  the 
best  results  reported  in  [28].  R1  and  R2  denote  the  reconstructive 
approach,  while  D1  and  D2  stand  for  the  discriminative  one.  A 
Gaussian  regularization  has  been  used  for  R1  and  Dl,  a  graph- 
cut-based  one  for  R2  and  D2.  The  best  results  for  each  image  are 
in  bold. 


Figure  4.  Subset  of  the  Brodatz  dataset  with  various  number  of 
classes:  From  top  to  bottom,  images  4,  7,  9  and  12.  The  ground- 
truth  segmentation  is  displayed  in  the  middle  and  the  resulting  seg¬ 
mentation  on  the  right  side  with  a  graph-cut  regularization.  Note 
that  the  segmentation  is  in  general  very  precise  but  fails  at  sepa¬ 
rating  two  classes  on  image  7. 

4.2.  Learning  discriminant  images  patches 

To  assess  the  promise  of  our  local  appearance  model,  we 
verify  its  ability  to  learn  discriminative  patches  for  object 
categories  from  a  very  general  database  with  a  high  variabil¬ 
ity.  To  that  effect,  we  have  chosen  some  classes  from  the 
Pascal  VOC06  database  [26]  and  conducted  the  following 
qualitative  experiment:  Given  one  object  class  A  (e.g.,  bi¬ 
cycle,  sheep,  car,  cow),  we  build  two  sets  of  patches  Si  and 
S2.  The  first  one  is  composed  of  200  000  12  x  12  patches. 


extracted  from  bounding  boxes  that  contain  one  object  of 
class  A.  The  second  one  was  composed  of  200  000  12  x  12 
background  patches  from  an  image  that  contains  one  object 
of  class  A.  This  way,  classification  at  the  patch  level  is  dif¬ 
ficult  since  the  overlap  between  Si  and  S2  is  important:  (i) 
many  small  patches  from  an  object  look  like  some  patches 
from  the  background  and  vice-versa;  (ii)  some  patches  from 
an  object’s  bounding  boxes  do  not  necessarily  fully  overlap 
with  the  object. 

Analyzing  globally  the  patches  as  a  whole,  or  using  a 
multiscale  framework  like  in  [1,  36],  to  capture  global  ap¬ 
pearance  of  objects,  are  among  the  possibilities  that  could 
make  this  problem  more  tractable,  but  these  are  beyond  the 
scope  of  this  paper.  Instead,  we  want  to  show  here  that 
our  purely  local  model  can  deal  with  this  overlap  between 
classes  and  learn  the  local  parts  of  objects  from  class  A 
that  are  discriminative,  when  observed  at  the  patch  level. 
The  experiment  we  did  consists  of  learning  two  discrimina¬ 
tive  dictionaries,  Di  for  Si  and  D2  for  S2,  with  k  =  128, 
L  =  4  and  15  iterations  of  our  algorithm,  and  then  to  pursue 
the  discriminative  learning  during  15  additional  iterations, 
but  at  each  new  iteration,  pruning  the  set  Si  by  keeping 
the  90%  “best  classified  patches.”  This  way,  one  hopes  to 
remove  the  overlap  between  Si  and  S2  and  to  enforce  the 
learning  on  the  key-patches  of  the  objects.  Examples  with 
various  classes  of  the  Pascal  dataset  are  presented  in  Fig¬ 
ure  5.  All  the  images  we  used  in  our  test  procedure  are 
from  the  official  validation  set  and  are  not  used  during  the 
training.  The  test  images  are  rescaled  so  that  the  maximum 
between  the  height  and  the  width  of  the  image  is  less  than 
256  pixels.  The  same  prefiltering  as  for  the  texture  seg¬ 
mentation  is  applied  and  the  average  color  of  each  patch  is 
taken  into  account.  The  learned  key-patches  focus  on  parts 
of  the  object  that  stand  as  locally  discriminative  compared 
to  the  background.  These  eventually  could  be  used  as  inputs 
to  other  algorithms  of  the  “bags-of-words”  type.  Figure  6 
shows  examples  of  the  learned  dictionaries  obtained  with 
the  discriminative  and  the  reconstructive  approaches. 

4.3.  Weakly-supervised  feature  selection 

Using  the  same  methodology  as  in  [25,  36],  we  evalu¬ 
ate  quantitatively  our  pixel  wise  classification  on  the  “bike” 
category  from  the  Graz02  dataset  [24],  in  a  weakly  super¬ 
vised  fashion  (without  using  any  ground  truth  information 
or  bounding  box  during  the  training),  with  the  same  pa¬ 
rameters,  pre-processing  and  algorithm  as  in  the  previous 
subsection  (which  prunes  iteratively  the  training  set  after  it¬ 
eration  15),  except  that  we  use  a  patch  size  of  n  =  15x15 
and  process  the  images  at  half  resolution  to  capture  more 
context  around  each  pixel.  We  use  the  first  300  images  of 
the  classes  “bike”  and  “background”  and  use  odd  images 
for  training,  keeping  the  even  images  for  testing.  To  pro¬ 
duce  a  confidence  value  per  pixel  we  have  chosen  to  mea¬ 
sure  the  reconstruction  errors  of  the  tested  patches  with  dif¬ 
ferent  sparsity  factors  L  =  1, . . . ,  15  and  use  these  values 


Figure  5.  Learning  of  key -patches  from  the  Pascal  VOC06  dataset. 
Column  1  presents  the  test  image.  Columns  2,3,4  present  the  raw 
pixelwise  classification  results  obtained  respectively  at  iterations 
20,25  and  30  of  the  procedure,  during  the  pruning  of  the  dataset. 
Interestingly,  the  vertical  and  horizontal  edges  of  the  bicycles  are 
not  considered  as  locally  discriminative  in  an  urban  environment. 


as  feature  vectors  in  a  logistic  linear  classifier.  A  Gaussian 
regularization  similar  to  that  used  in  our  texture  segmen¬ 
tation  experiments  is  applied  and  has  proven  to  improve 
noticeably  the  classification  performance.  Corresponding 
precision-recall  curves  are  presented  on  Figure  7  and  com¬ 
pared  with  [25,  36].  As  one  can  see,  our  algorithm  produces 
the  best  results.  Nevertheless,  a  more  exhaustive  study  with 
different  classes  and  datasets  with  a  more  precise  ground 
truth  would  be  needed  to  draw  general  conclusions  about 
the  relative  performance  of  these  three  methods. 

5.  Conclusion  and  future  directions 

We  have  introduced  a  novel  framework  for  using  learned 
sparse  image  representations  in  local  classification  tasks. 


T 

0 

m 

B 

\ 

m 

\ 

I| 

i. 

4^ 

l_ 

z 

n 

'2’ 

□ 

= 

■  i' 

P 

i 

■: 

J/' 

” 

1- 

m 

s' 

- 

1^. 

•^V 

B 

\~ 

f 

- 

1 

■'i 

k 

f 

s 

ii 

y 

i 

i 

(a)  Reconstructive,  bicycle 


z 

I 

1 

z 

0 

[\ 

y 

ttr 

i 

t 

k  ■ 

TT 

1 

1 

1 

8 

/ 

/ 

y 

W- 

\ 

I 

■ij 

i 

r 

s 

r 

If . 

/ 

s 

w 

L 

1 

t 

(c)  Discriminative,  bicycle 


piniHiailiEE 


E3 

% 

_ 

Q 

B 

B 

il 

R 

m 

E 

¥ 

-- 

& 

— 

■St- 

/r 

- 

aa 

h; 

1 

U 

i 

m 

i-iJ 

m 

□ 

(b)  Reconstructive,  background 


a 

5=, 

- 

It 

p, 

□ 

Zi 

1 

fc. 

fi 

_ 

.V 

u 

% 

— 

i 

■: 

1 

■■ 

DU 

ivi 

i:: 

■y 

1 

\= 

n 

i 

(d)  Discriminative,  background 


Figure  6.  Parts  of  the  dictionaries,  learned  on  the  class  ‘bicycle’ 
from  the  Pascal  VOC06  dataset.  The  left  part  has  been  learned  on 
bounding  boxes  containing  a  bicycle,  the  right  part  on  background 
regions.  The  resulting  dictionaries  from  the  two  approaches,  re¬ 
constructive  and  discriminative,  are  presented.  Visually,  the  dic¬ 
tionaries  produced  by  the  discriminative  approach  are  less  similar 
to  each  other  than  with  the  reconstructive  one. 


Figure  7.  Precision-recall  curve  obtained  by  our  framework  for  the 
bikes,  without  pruning  of  the  training  dataset  (green,  continuous), 
and  after  5  pruning  iterations  (red,  continuous),  compared  with 
the  one  from  [25]  (blue,  dashed)  and  [36]  (black,  dotted). 


Using  a  local  sparsity  prior  on  images,  our  algorithm  learns 
the  local  appearance  of  object  categories  in  a  discrimina¬ 
tive  framework.  This  is  achieved  via  an  efficient  optimiza¬ 
tion  of  an  energy  function,  leading  to  the  learning  of  over¬ 
complete  and  non-parametric  dictionaries  that  are  explic¬ 
itly  optimized  to  be  both  representative  and  discriminative. 
We  have  shown  that  the  proposed  approach  leads  to  state- 
of-the-art  segmentation  results  on  the  Brodatz  dataset,  with 
significant  improvements  over  previously  published  meth¬ 
ods  for  most  examples.  Applied  to  more  general  image 
datasets,  mainly  of  natural  images,  it  permits  to  learn  some 
key-patches  of  objects  and  to  perform  local  discrimina¬ 
tion/segmentation  . 

We  are  also  currently  pursuing  a  discriminative  multi¬ 
scale  analysis.  This  could  be  embedded  into  a  graph-cut- 
based  segmentation  framework,  which  should  take  into  ac¬ 
count  both  the  local  classification  and  the  more  global  im¬ 
age  characteristics,  as  in  [32].  In  general,  we  would  like  to 
build  a  model  that  enjoys  both  global  and  local  image  anal- 


ysis  capabilities,  using  for  example  the  coefficients  of  the 
decompositions  and/or  the  reconstruction  error  as  local  dis¬ 
criminants,  combined  with  more  global  learned  geometric 
constraints  between  the  patches,  as  currently  being  investi¬ 
gated  in  the  scene  analysis  community. 
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